Storytelling with Data

Posted on Thu 17 January 2019 in posts

The main components of storytelling with data are:

  1. Understanding and setting the context
  2. Exploring multiple angles
  3. Using compelling visualizations
  4. Using varied data sources
  5. Having a consistent narrative

Collecting data and finding supplementary data

Let's take a look at SAT score data in New York City. NYC OpenData provides several different datasets on their website for the public to use. The core of this project will be based around two datasets: SAT Results and the NYC High School Directory.

Using only these two datasets may be enough to produce a basic analysis. However, in order to provide more depth to the topic and to enhance the analysis, we should look to supplement the core data with other relevant datasets. Incorporating a multitude of data to support a compelling analysis will help drive home the eventual results.

In this case, there are several related data sets we can use from the website that cover demographic information and test scores.

We will use the following datasets in this analysis:

These datasets are interrelated and therefore we can combine them before doing any analysis.

Getting background information

Before diving into analyzing the data, it's useful to research some background information. In this case, we know a few facts that will be useful:

  • New York City is divided into 5 boroughs, which are essentially distinct regions.
  • Schools in New York City are divided into several school district, each of which can contains dozens of schools.
  • Not all the schools in all of the datasets are high schools, so we'll need to do some data cleaning.
  • Each school in New York City has a unique code called a DBN, or District Borough Number.
  • By aggregating data by district, we can use the district mapping data to plot district-by-district differences.

Exploratory data analysis

The first step to understand the data is to do some exploratory data analysis, so let's examine the datasets we've gathered above.

NYC OpenData makes most of their data sets available via API on their website. The data is available in both JSON and CSV format. For example, the data for SAT results is available in CSV format from this link. For the purposes of this analysis, we will pull from the API in CSV format.

First, let's create a dictionary to store each data set and API link. Then we can loop through that dictionary to read each file into a pandas DataFrame.

In [1]:
import pandas as pd
import numpy as np

files = {'sat_results':'sat_results.csv',
        'hs_directory':'hs_directory.csv',
        'math_results':'math_results.csv',
        'class_size':'class_size.csv',
        'ap_results':'ap_results.csv',
        'grad_outcomes':'grad_outcomes.csv',
        'demographics':'demographics.csv'
        }

# Put data from each link into a DataFrame then store the data in
# a new dictionary called 'data'
data = {}
for key, value in files.items():
    d = pd.read_csv("/Users/swagbird/Desktop/Data Science/Projects/"\
                    "Dataquest Portfolio Series/Storytelling with Data"\
                    "/data/{0}".format(value))
    data[key] = d
In [2]:
for k,v in data.items():
    print(v.shape)
(478, 6)
(435, 64)
(28478, 16)
(27611, 16)
(258, 5)
(25096, 23)
(10075, 38)
In [3]:
# Print columns and the first 5 lines of each DataFrame in our dictionary
for k,v in data.items():
    print("\n" + k + "\n")
    print(v.columns)
    print(v.head())
sat_results

Index(['DBN', 'SCHOOL NAME', 'Num of SAT Test Takers',
       'SAT Critical Reading Avg. Score', 'SAT Math Avg. Score',
       'SAT Writing Avg. Score'],
      dtype='object')
      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404   
1                     91                             383                 423   
2                     70                             377                 402   
3                      7                             414                 401   
4                     44                             390                 433   

  SAT Writing Avg. Score  
0                    363  
1                    366  
2                    370  
3                    359  
4                    384  

hs_directory

Index(['dbn', 'school_name', 'borough', 'building_code', 'phone_number',
       'fax_number', 'grade_span_min', 'grade_span_max', 'expgrade_span_min',
       'expgrade_span_max', 'bus', 'subway', 'primary_address_line_1', 'city',
       'state_code', 'postcode', 'website', 'total_students', 'campus_name',
       'school_type', 'overview_paragraph', 'program_highlights',
       'language_classes', 'advancedplacement_courses', 'online_ap_courses',
       'online_language_courses', 'extracurricular_activities',
       'psal_sports_boys', 'psal_sports_girls', 'psal_sports_coed',
       'school_sports', 'partner_cbo', 'partner_hospital', 'partner_highered',
       'partner_cultural', 'partner_nonprofit', 'partner_corporate',
       'partner_financial', 'partner_other', 'addtl_info1', 'addtl_info2',
       'start_time', 'end_time', 'se_services', 'ell_programs',
       'school_accessibility_description', 'number_programs', 'priority01',
       'priority02', 'priority03', 'priority04', 'priority05', 'priority06',
       'priority07', 'priority08', 'priority09', 'priority10', 'Location 1',
       'Community Board', 'Council District', 'Census Tract', 'BIN', 'BBL',
       'NTA'],
      dtype='object')
      dbn                                        school_name    borough  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number  grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262             9.0              12   
1          X400    718-842-0687  718-589-9849             9.0              12   
2          X240    718-294-8111  718-294-8109             6.0              12   
3          M520  718-935-3477             NaN             9.0              10   
4          Q695    718-969-3155  718-969-3552             6.0              12   

   expgrade_span_min  expgrade_span_max  \
0                NaN                NaN   
1                NaN                NaN   
2                NaN                NaN   
3                9.0               14.0   
4                NaN                NaN   

                         ...                         priority08 priority09  \
0                        ...                                NaN        NaN   
1                        ...                                NaN        NaN   
2                        ...                                NaN        NaN   
3                        ...                                NaN        NaN   
4                        ...                                NaN        NaN   

  priority10                                         Location 1  \
0        NaN  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...   
1        NaN  1110 Boston Road\nBronx, NY 10456\n(40.8276026...   
2        NaN  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...   
3        NaN  411 Pearl Street\nNew York, NY 10038\n(40.7106...   
4        NaN  160 20 Goethals Avenue\nJamaica, NY 11432\n(40...   

  Community Board  Council District Census Tract        BIN           BBL  \
0             9.0              35.0        213.0  3029686.0  3.011870e+09   
1             3.0              16.0        135.0  2004526.0  2.026340e+09   
2             4.0              14.0        209.0  2008336.0  2.028590e+09   
3             1.0               1.0         29.0  1001388.0  1.001130e+09   
4             8.0              24.0       1267.0  4539721.0  4.068580e+09   

                                                 NTA  
0  Crown Heights South                           ...  
1  Morrisania-Melrose                            ...  
2  West Concourse                                ...  
3  Chinatown                                     ...  
4  Pomonok-Flushing Heights-Hillcrest            ...  

[5 rows x 64 columns]

math_results

Index(['DBN', 'Grade', 'Year', 'Category', 'Number Tested', 'Mean Scale Score',
       'Level 1 #', 'Level 1 %', 'Level 2 #', 'Level 2 %', 'Level 3 #',
       'Level 3 %', 'Level 4 #', 'Level 4 %', 'Level 3+4 #', 'Level 3+4 %'],
      dtype='object')
      DBN Grade  Year      Category  Number Tested Mean Scale Score Level 1 #  \
0  01M015     3  2006  All Students             39              667         2   
1  01M015     3  2007  All Students             31              672         2   
2  01M015     3  2008  All Students             37              668         0   
3  01M015     3  2009  All Students             33              668         0   
4  01M015     3  2010  All Students             26              677         6   

  Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 # Level 4 %  \
0       5.1        11      28.2        20      51.3         6      15.4   
1       6.5         3       9.7        22        71         4      12.9   
2         0         6      16.2        29      78.4         2       5.4   
3         0         4      12.1        28      84.8         1         3   
4      23.1        12      46.2         6      23.1         2       7.7   

  Level 3+4 # Level 3+4 %  
0          26        66.7  
1          26        83.9  
2          31        83.8  
3          29        87.9  
4           8        30.8  

class_size

Index(['CSD', 'BOROUGH', 'SCHOOL CODE', 'SCHOOL NAME', 'GRADE ',
       'PROGRAM TYPE', 'CORE SUBJECT (MS CORE and 9-12 ONLY)',
       'CORE COURSE (MS CORE and 9-12 ONLY)', 'SERVICE CATEGORY(K-9* ONLY)',
       'NUMBER OF STUDENTS / SEATS FILLED', 'NUMBER OF SECTIONS',
       'AVERAGE CLASS SIZE', 'SIZE OF SMALLEST CLASS', 'SIZE OF LARGEST CLASS',
       'DATA SOURCE', 'SCHOOLWIDE PUPIL-TEACHER RATIO'],
      dtype='object')
   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT   
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED   
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT   
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED   

  CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY)  \
0                                    -                                   -   
1                                    -                                   -   
2                                    -                                   -   
3                                    -                                   -   
4                                    -                                   -   

  SERVICE CATEGORY(K-9* ONLY)  NUMBER OF STUDENTS / SEATS FILLED  \
0                           -                               19.0   
1                           -                               21.0   
2                           -                               17.0   
3                           -                               17.0   
4                           -                               15.0   

   NUMBER OF SECTIONS  AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  \
0                 1.0                19.0                    19.0   
1                 1.0                21.0                    21.0   
2                 1.0                17.0                    17.0   
3                 1.0                17.0                    17.0   
4                 1.0                15.0                    15.0   

   SIZE OF LARGEST CLASS DATA SOURCE  SCHOOLWIDE PUPIL-TEACHER RATIO  
0                   19.0         ATS                             NaN  
1                   21.0         ATS                             NaN  
2                   17.0         ATS                             NaN  
3                   17.0         ATS                             NaN  
4                   15.0         ATS                             NaN  

ap_results

Index(['DBN', 'SchoolName', 'AP Test Takers ', 'Total Exams Taken',
       'Number of Exams with scores 3 4 or 5'],
      dtype='object')
      DBN                             SchoolName  AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.             39.0   
1  01M450                 EAST SIDE COMMUNITY HS             19.0   
2  01M515                    LOWER EASTSIDE PREP             24.0   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH            255.0   
4  02M296  High School of Hospitality Management              NaN   

   Total Exams Taken  Number of Exams with scores 3 4 or 5  
0               49.0                                  10.0  
1               21.0                                   NaN  
2               26.0                                  24.0  
3              377.0                                 191.0  
4                NaN                                   NaN  

grad_outcomes

Index(['Demographic', 'DBN', 'School Name', 'Cohort', 'Total Cohort',
       'Total Grads - n', 'Total Grads - % of cohort', 'Total Regents - n',
       'Total Regents - % of cohort', 'Total Regents - % of grads',
       'Advanced Regents - n', 'Advanced Regents - % of cohort',
       'Advanced Regents - % of grads', 'Regents w/o Advanced - n',
       'Regents w/o Advanced - % of cohort',
       'Regents w/o Advanced - % of grads', 'Local - n', 'Local - % of cohort',
       'Local - % of grads', 'Still Enrolled - n',
       'Still Enrolled - % of cohort', 'Dropped Out - n',
       'Dropped Out - % of cohort'],
      dtype='object')
    Demographic     DBN                            School Name    Cohort  \
0  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2003   
1  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2004   
2  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2005   
3  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2006   
4  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL  2006 Aug   

   Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
0             5               s                         s                 s   
1            55              37                      67.3                17   
2            64              43                      67.2                27   
3            78              43                      55.1                36   
4            78              44                      56.4                37   

  Total Regents - % of cohort Total Regents - % of grads  \
0                           s                          s   
1                        30.9                       45.9   
2                        42.2                       62.8   
3                        46.2                       83.7   
4                        47.4                       84.1   

             ...            Regents w/o Advanced - n  \
0            ...                                   s   
1            ...                                  17   
2            ...                                  27   
3            ...                                  36   
4            ...                                  37   

  Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads  \
0                                  s                                 s   
1                               30.9                              45.9   
2                               42.2                              62.8   
3                               46.2                              83.7   
4                               47.4                              84.1   

  Local - n Local - % of cohort  Local - % of grads Still Enrolled - n  \
0         s                   s                   s                  s   
1        20                36.4                54.1                 15   
2        16                  25  37.200000000000003                  9   
3         7                   9                16.3                 16   
4         7                   9                15.9                 15   

  Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort  
0                            s               s                         s  
1                         27.3               3                       5.5  
2                         14.1               9                      14.1  
3                         20.5              11                      14.1  
4                         19.2              11                      14.1  

[5 rows x 23 columns]

demographics

Index(['DBN', 'Name', 'schoolyear', 'fl_percent', 'frl_percent',
       'total_enrollment', 'prek', 'k', 'grade1', 'grade2', 'grade3', 'grade4',
       'grade5', 'grade6', 'grade7', 'grade8', 'grade9', 'grade10', 'grade11',
       'grade12', 'ell_num', 'ell_percent', 'sped_num', 'sped_percent',
       'ctt_num', 'selfcontained_num', 'asian_num', 'asian_per', 'black_num',
       'black_per', 'hispanic_num', 'hispanic_per', 'white_num', 'white_per',
       'male_num', 'male_per', 'female_num', 'female_per'],
      dtype='object')
      DBN                       Name  schoolyear fl_percent  frl_percent  \
0  01M015  P.S. 015 ROBERTO CLEMENTE    20052006       89.4          NaN   
1  01M015  P.S. 015 ROBERTO CLEMENTE    20062007       89.4          NaN   
2  01M015  P.S. 015 ROBERTO CLEMENTE    20072008       89.4          NaN   
3  01M015  P.S. 015 ROBERTO CLEMENTE    20082009       89.4          NaN   
4  01M015  P.S. 015 ROBERTO CLEMENTE    20092010                    96.5   

   total_enrollment prek   k grade1 grade2    ...     black_num black_per  \
0               281   15  36     40     33    ...            74      26.3   
1               243   15  29     39     38    ...            68      28.0   
2               261   18  43     39     36    ...            77      29.5   
3               252   17  37     44     32    ...            75      29.8   
4               208   16  40     28     32    ...            67      32.2   

  hispanic_num hispanic_per white_num white_per male_num male_per female_num  \
0          189         67.3         5       1.8    158.0     56.2      123.0   
1          153         63.0         4       1.6    140.0     57.6      103.0   
2          157         60.2         7       2.7    143.0     54.8      118.0   
3          149         59.1         7       2.8    149.0     59.1      103.0   
4          118         56.7         6       2.9    124.0     59.6       84.0   

  female_per  
0       43.8  
1       42.4  
2       45.2  
3       40.9  
4       40.4  

[5 rows x 38 columns]

Unifying the data

Now that we have all our datasets imported into DataFrames. Let's try to unify them all into one. Looks like most of the datasets have a DBN column. DBN serves as a unique identifier for each school and it looks like it's present in most of our datasets. Let's do a quick check:

In [4]:
# Print columns and the first 5 lines of each DataFrame in our dictionary
for k,v in data.items():
    if 'DBN' in v.columns:
        print(k)
sat_results
math_results
ap_results
grad_outcomes
demographics

DBN is missing from two datasets: hs_directory and class_size. In hs_directory the DBN column is named dbn, so we just need to rename that column. Let's find out how we can link class_size back to the other six datasets.

The DBN column looks like this:

In [5]:
data['demographics'].DBN.head()
Out[5]:
0    01M015
1    01M015
2    01M015
3    01M015
4    01M015
Name: DBN, dtype: object

What does the class_size dataset look like?

In [6]:
data['class_size'].head()
Out[6]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO
0 1 M M015 P.S. 015 Roberto Clemente 0K GEN ED - - - 19.0 1.0 19.0 19.0 19.0 ATS NaN
1 1 M M015 P.S. 015 Roberto Clemente 0K CTT - - - 21.0 1.0 21.0 21.0 21.0 ATS NaN
2 1 M M015 P.S. 015 Roberto Clemente 01 GEN ED - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN
3 1 M M015 P.S. 015 Roberto Clemente 01 CTT - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN
4 1 M M015 P.S. 015 Roberto Clemente 02 GEN ED - - - 15.0 1.0 15.0 15.0 15.0 ATS NaN

Comparing the DBN column to the class_size data, it looks like DBN is a combination of the first three columns: CSD, BOROUGH, and SCHOOL CODE. DBN stands for "District Borough Number", so it seems that CSD is the "District", BOROUGH is the "Borough", and SCHOOL CODE is the "Number". It looks like BOROUGH is already contained within SCHOOL CODE, and therefore, DBN can be formulated by combining CSD with SCHOOL CODE.

Let's add the DBN column to class_size and rename it in hs_directory:

In [7]:
data['class_size']["DBN"] = data['class_size'].apply(lambda x: '{0:02d}{1}'.format(x['CSD'], x['SCHOOL CODE']), axis=1)
In [8]:
data['class_size'].head(1)
Out[8]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO DBN
0 1 M M015 P.S. 015 Roberto Clemente 0K GEN ED - - - 19.0 1.0 19.0 19.0 19.0 ATS NaN 01M015
In [9]:
data['hs_directory'].rename(columns={'dbn':'DBN'}, inplace=True)
In [10]:
data['hs_directory'].head(1)
Out[10]:
DBN school_name borough building_code phone_number fax_number grade_span_min grade_span_max expgrade_span_min expgrade_span_max ... priority08 priority09 priority10 Location 1 Community Board Council District Census Tract BIN BBL NTA
0 17K548 Brooklyn School for Music & Theatre Brooklyn K440 718-230-6250 718-230-6262 9.0 12 NaN NaN ... NaN NaN NaN 883 Classon Avenue\nBrooklyn, NY 11225\n(40.67... 9.0 35.0 213.0 3029686.0 3.011870e+09 Crown Heights South ...

1 rows × 64 columns

Including survey data

The survey data is split up between two files: data for all schools and data for school district 75. We will create a flag for district 75 schools, and then we will concatenate these two datasets into one combined DataFrame. Afterwards, we can add this new dataset to our data dictionary.

In [11]:
survey_all = pd.read_csv('/Users/swagbird/Desktop/Data Science/Projects/Dataquest Portfolio Series/Storytelling with Data/data/survey_all.csv', skiprows=2, encoding='windows-1252')
survey_d75 = pd.read_csv('/Users/swagbird/Desktop/Data Science/Projects/Dataquest Portfolio Series/Storytelling with Data/data/survey_d75.csv', skiprows=2, encoding='windows-1252')
survey_all['d75'] = False
survey_d75['d75'] = True
survey = pd.concat([survey_all, survey_d75], axis=0, sort = False)
In [12]:
survey.head()
Out[12]:
dbn schoolname rr_s rr_t rr_p N_s N_t N_p nr_s nr_t ... s_q14_2 s_q14_3 s_q14_4 s_q14_5 s_q14_6 s_q14_7 s_q14_8 s_q14_9 s_q14_10 s_q14_11
0 01M015 P.S. 015 Roberto Clemente NaN 88 60 NaN 22.0 90.0 0 25 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 01M019 P.S. 019 Asher Levy NaN 100 60 NaN 34.0 161.0 0 33 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 01M020 P.S. 020 Anna Silver NaN 88 73 NaN 42.0 367.0 0 48 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 01M034 P.S. 034 Franklin D. Roosevelt 89.0 73 50 145.0 29.0 151.0 163 40 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 01M063 P.S. 063 William McKinley NaN 100 60 NaN 23.0 90.0 0 23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 2769 columns

Let's remove unnecessary columns within our new survey dataset.

In [13]:
# Rename DBN column and filter only for important columns in dataset
survey["DBN"] = survey["dbn"]
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s",
                 "N_t", "N_p", "saf_p_11", "com_p_11",
                 "eng_p_11", "aca_p_11", "saf_t_11",
                 "com_t_11", "eng_t_11", "aca_t_11",
                 "saf_s_11", "com_s_11", "eng_s_11",
                 "aca_s_11", "saf_tot_11", "com_tot_11",
                 "eng_tot_11", "aca_tot_11",]
survey = survey.loc[:,survey_fields]
survey.shape
Out[13]:
(1702, 23)
In [14]:
# Add survey dataset to our dictionary of datasets
data["survey"] = survey
data["survey"].head()
Out[14]:
DBN rr_s rr_t rr_p N_s N_t N_p saf_p_11 com_p_11 eng_p_11 ... eng_t_11 aca_t_11 saf_s_11 com_s_11 eng_s_11 aca_s_11 saf_tot_11 com_tot_11 eng_tot_11 aca_tot_11
0 01M015 NaN 88 60 NaN 22.0 90.0 8.5 7.6 7.5 ... 7.6 7.9 NaN NaN NaN NaN 8.0 7.7 7.5 7.9
1 01M019 NaN 100 60 NaN 34.0 161.0 8.4 7.6 7.6 ... 8.9 9.1 NaN NaN NaN NaN 8.5 8.1 8.2 8.4
2 01M020 NaN 88 73 NaN 42.0 367.0 8.9 8.3 8.3 ... 6.8 7.5 NaN NaN NaN NaN 8.2 7.3 7.5 8.0
3 01M034 89.0 73 50 145.0 29.0 151.0 8.8 8.2 8.0 ... 6.8 7.8 6.2 5.9 6.5 7.4 7.3 6.7 7.1 7.9
4 01M063 NaN 100 60 NaN 23.0 90.0 8.7 7.9 8.1 ... 7.8 8.1 NaN NaN NaN NaN 8.5 7.6 7.9 8.0

5 rows × 23 columns

Condensing datasets

In order to combine all of the datasets, there can only be one unique DBN value for each row. Taking a look at the class_size dataset, we see that this is not the case.

In [15]:
data['class_size']['GRADE '].unique()
Out[15]:
array(['0K', '01', '02', '03', '04', '05', '0K-09', nan, '06', '07', '08',
       'MS Core', '09-12', '09'], dtype=object)

To condense the class_size dataset, we can filter out values in the GRADE (we only want grades 9-12) and PROGRAM TYPE (we only want general education) columns. Then, we can group by DBN and calculate for the average class_size values for each school.

In [16]:
class_size = data['class_size']
class_size = class_size[class_size['GRADE '] == '09-12']
class_size = class_size[class_size['PROGRAM TYPE'] == 'GEN ED']
class_size = class_size.groupby('DBN').agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

Next, we will condense the demographics dataset. There is data for multiple years for the same schools, so let's filter by the most recent year available.

In [17]:
data['demographics'] = data['demographics'][data['demographics']['schoolyear'] == 20112012]
In [18]:
data['demographics'].head()
Out[18]:
DBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 ... black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per
6 01M015 P.S. 015 ROBERTO CLEMENTE 20112012 NaN 89.4 189 13 31 35 28 ... 63 33.3 109 57.7 4 2.1 97.0 51.3 92.0 48.7
13 01M019 P.S. 019 ASHER LEVY 20112012 NaN 61.5 328 32 46 52 54 ... 81 24.7 158 48.2 28 8.5 147.0 44.8 181.0 55.2
20 01M020 PS 020 ANNA SILVER 20112012 NaN 92.5 626 52 102 121 87 ... 55 8.8 357 57.0 16 2.6 330.0 52.7 296.0 47.3
27 01M034 PS 034 FRANKLIN D ROOSEVELT 20112012 NaN 99.7 401 14 34 38 36 ... 90 22.4 275 68.6 8 2.0 204.0 50.9 197.0 49.1
35 01M063 PS 063 WILLIAM MCKINLEY 20112012 NaN 78.9 176 18 20 30 21 ... 41 23.3 110 62.5 15 8.5 97.0 55.1 79.0 44.9

5 rows × 38 columns

We will need to do the same for math_results. This data is segmented by Grade and Year. Let's choose the highest grade and the latest year.

In [19]:
data["math_results"] = data["math_results"][data["math_results"]["Year"] == 2011]
data["math_results"] = data["math_results"][data["math_results"]["Grade"] == '8']
In [20]:
data['math_results'].head()
Out[20]:
DBN Grade Year Category Number Tested Mean Scale Score Level 1 # Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 # Level 4 % Level 3+4 # Level 3+4 %
111 01M034 8 2011 All Students 48 646 15 31.3 22 45.8 11 22.9 0 0 11 22.9
280 01M140 8 2011 All Students 61 665 1 1.6 43 70.5 17 27.9 0 0 17 27.9
346 01M184 8 2011 All Students 49 727 0 0 0 0 5 10.2 44 89.8 49 100
388 01M188 8 2011 All Students 49 658 10 20.4 26 53.1 10 20.4 3 6.1 13 26.5
411 01M292 8 2011 All Students 49 650 15 30.6 25 51 7 14.3 2 4.1 9 18.4

grad outcomes is the last dataset that needs to be condensed:

In [21]:
data['grad_outcomes'] = data['grad_outcomes'][data['grad_outcomes']['Demographic'] == 'Total Cohort']
data['grad_outcomes'] = data['grad_outcomes'][data['grad_outcomes']['Cohort'] == '2006']
In [22]:
data['grad_outcomes'].head()
Out[22]:
Demographic DBN School Name Cohort Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n Total Regents - % of cohort Total Regents - % of grads ... Regents w/o Advanced - n Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads Local - n Local - % of cohort Local - % of grads Still Enrolled - n Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort
3 Total Cohort 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL 2006 78 43 55.1 36 46.2 83.7 ... 36 46.2 83.7 7 9 16.3 16 20.5 11 14.1
10 Total Cohort 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 2006 124 53 42.7 42 33.9 79.2 ... 34 27.4 64.2 11 8.9 20.8 46 37.1 20 16.100000000000001
17 Total Cohort 01M450 EAST SIDE COMMUNITY SCHOOL 2006 90 70 77.8 67 74.400000000000006 95.7 ... 67 74.400000000000006 95.7 3 3.3 4.3 15 16.7 5 5.6
24 Total Cohort 01M509 MARTA VALLE HIGH SCHOOL 2006 84 47 56 40 47.6 85.1 ... 23 27.4 48.9 7 8.300000000000001 14.9 25 29.8 5 6
31 Total Cohort 01M515 LOWER EAST SIDE PREPARATORY HIGH SCHO 2006 193 105 54.4 91 47.2 86.7 ... 22 11.4 21 14 7.3 13.3 53 27.5 35 18.100000000000001

5 rows × 23 columns

Computing variables

Let's compute the total SAT score by adding the values from the individual columns for Critical Reading, Math, and Writing. First, we need to convert the values in these columns from strings to numbers before we can add them together.

In [23]:
cols = ['SAT Critical Reading Avg. Score',
        'SAT Math Avg. Score', 'SAT Writing Avg. Score']

for c in cols:
    data['sat_results'][c] = pd.to_numeric(data['sat_results'][c], errors='coerce')

data['sat_results']['total_sat_score'] = data['sat_results'][[cols[0], cols[1], cols[2]]].sum(axis=1) 
In [24]:
# Drop rows for schools with no SAT results
print(data['sat_results'].shape)
data['sat_results'].dropna(inplace=True)
print(data['sat_results'].shape)
(478, 7)
(421, 7)

To plot the locations of each school on a map, we need to parse out the coordinate locations for each school.

In [25]:
data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(lambda x: float(x.split("\n")[2].strip('()').split(',')[0]))
data['hs_directory']['lon'] = data['hs_directory']['Location 1'].apply(lambda x: float(x.split("\n")[2].strip('()').split(',')[1]))
In [26]:
data['hs_directory'][['lat', 'lon']].head()
Out[26]:
lat lon
0 40.670299 -73.961648
1 40.827603 -73.904475
2 40.842414 -73.916162
3 40.710679 -74.000807
4 40.718810 -73.806500
In [27]:
for k, v in data.items():
    print(k)
    print(v.head())
sat_results
      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers  SAT Critical Reading Avg. Score  \
0                     29                            355.0   
1                     91                            383.0   
2                     70                            377.0   
3                      7                            414.0   
4                     44                            390.0   

   SAT Math Avg. Score  SAT Writing Avg. Score  total_sat_score  
0                404.0                   363.0           1122.0  
1                423.0                   366.0           1172.0  
2                402.0                   370.0           1149.0  
3                401.0                   359.0           1174.0  
4                433.0                   384.0           1207.0  
hs_directory
      DBN                                        school_name    borough  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number  grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262             9.0              12   
1          X400    718-842-0687  718-589-9849             9.0              12   
2          X240    718-294-8111  718-294-8109             6.0              12   
3          M520  718-935-3477             NaN             9.0              10   
4          Q695    718-969-3155  718-969-3552             6.0              12   

   expgrade_span_min  expgrade_span_max    ...     priority10  \
0                NaN                NaN    ...            NaN   
1                NaN                NaN    ...            NaN   
2                NaN                NaN    ...            NaN   
3                9.0               14.0    ...            NaN   
4                NaN                NaN    ...            NaN   

                                          Location 1 Community Board  \
0  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...             9.0   
1  1110 Boston Road\nBronx, NY 10456\n(40.8276026...             3.0   
2  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...             4.0   
3  411 Pearl Street\nNew York, NY 10038\n(40.7106...             1.0   
4  160 20 Goethals Avenue\nJamaica, NY 11432\n(40...             8.0   

  Council District Census Tract        BIN           BBL  \
0             35.0        213.0  3029686.0  3.011870e+09   
1             16.0        135.0  2004526.0  2.026340e+09   
2             14.0        209.0  2008336.0  2.028590e+09   
3              1.0         29.0  1001388.0  1.001130e+09   
4             24.0       1267.0  4539721.0  4.068580e+09   

                                                 NTA        lat        lon  
0  Crown Heights South                           ...  40.670299 -73.961648  
1  Morrisania-Melrose                            ...  40.827603 -73.904475  
2  West Concourse                                ...  40.842414 -73.916162  
3  Chinatown                                     ...  40.710679 -74.000807  
4  Pomonok-Flushing Heights-Hillcrest            ...  40.718810 -73.806500  

[5 rows x 66 columns]
math_results
        DBN Grade  Year      Category  Number Tested Mean Scale Score  \
111  01M034     8  2011  All Students             48              646   
280  01M140     8  2011  All Students             61              665   
346  01M184     8  2011  All Students             49              727   
388  01M188     8  2011  All Students             49              658   
411  01M292     8  2011  All Students             49              650   

    Level 1 # Level 1 % Level 2 # Level 2 % Level 3 # Level 3 % Level 4 #  \
111        15      31.3        22      45.8        11      22.9         0   
280         1       1.6        43      70.5        17      27.9         0   
346         0         0         0         0         5      10.2        44   
388        10      20.4        26      53.1        10      20.4         3   
411        15      30.6        25        51         7      14.3         2   

    Level 4 % Level 3+4 # Level 3+4 %  
111         0          11        22.9  
280         0          17        27.9  
346      89.8          49         100  
388       6.1          13        26.5  
411       4.1           9        18.4  
class_size
      DBN  CSD  NUMBER OF STUDENTS / SEATS FILLED  NUMBER OF SECTIONS  \
0  01M292    1                            88.0000            4.000000   
1  01M332    1                            46.0000            2.000000   
2  01M378    1                            33.0000            1.000000   
3  01M448    1                           105.6875            4.750000   
4  01M450    1                            57.6000            2.733333   

   AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  SIZE OF LARGEST CLASS  \
0           22.564286                   18.50              26.571429   
1           22.000000                   21.00              23.500000   
2           33.000000                   33.00              33.000000   
3           22.231250                   18.25              27.062500   
4           21.200000                   19.40              22.866667   

   SCHOOLWIDE PUPIL-TEACHER RATIO  
0                             NaN  
1                             NaN  
2                             NaN  
3                             NaN  
4                             NaN  
ap_results
      DBN                             SchoolName  AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.             39.0   
1  01M450                 EAST SIDE COMMUNITY HS             19.0   
2  01M515                    LOWER EASTSIDE PREP             24.0   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH            255.0   
4  02M296  High School of Hospitality Management              NaN   

   Total Exams Taken  Number of Exams with scores 3 4 or 5  
0               49.0                                  10.0  
1               21.0                                   NaN  
2               26.0                                  24.0  
3              377.0                                 191.0  
4                NaN                                   NaN  
grad_outcomes
     Demographic     DBN                            School Name Cohort  \
3   Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL   2006   
10  Total Cohort  01M448    UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   2006   
17  Total Cohort  01M450             EAST SIDE COMMUNITY SCHOOL   2006   
24  Total Cohort  01M509                MARTA VALLE HIGH SCHOOL   2006   
31  Total Cohort  01M515  LOWER EAST SIDE PREPARATORY HIGH SCHO   2006   

    Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
3             78              43                      55.1                36   
10           124              53                      42.7                42   
17            90              70                      77.8                67   
24            84              47                        56                40   
31           193             105                      54.4                91   

   Total Regents - % of cohort Total Regents - % of grads  \
3                         46.2                       83.7   
10                        33.9                       79.2   
17          74.400000000000006                       95.7   
24                        47.6                       85.1   
31                        47.2                       86.7   

              ...            Regents w/o Advanced - n  \
3             ...                                  36   
10            ...                                  34   
17            ...                                  67   
24            ...                                  23   
31            ...                                  22   

   Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads  \
3                                46.2                              83.7   
10                               27.4                              64.2   
17                 74.400000000000006                              95.7   
24                               27.4                              48.9   
31                               11.4                                21   

   Local - n Local - % of cohort Local - % of grads Still Enrolled - n  \
3          7                   9               16.3                 16   
10        11                 8.9               20.8                 46   
17         3                 3.3                4.3                 15   
24         7   8.300000000000001               14.9                 25   
31        14                 7.3               13.3                 53   

   Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort  
3                          20.5              11                      14.1  
10                         37.1              20        16.100000000000001  
17                         16.7               5                       5.6  
24                         29.8               5                         6  
31                         27.5              35        18.100000000000001  

[5 rows x 23 columns]
demographics
       DBN                                              Name  schoolyear  \
6   01M015  P.S. 015 ROBERTO CLEMENTE                           20112012   
13  01M019  P.S. 019 ASHER LEVY                                 20112012   
20  01M020  PS 020 ANNA SILVER                                  20112012   
27  01M034  PS 034 FRANKLIN D ROOSEVELT                         20112012   
35  01M063  PS 063 WILLIAM MCKINLEY                             20112012   

   fl_percent  frl_percent  total_enrollment prek    k grade1 grade2  \
6         NaN         89.4               189   13   31     35     28   
13        NaN         61.5               328   32   46     52     54   
20        NaN         92.5               626   52  102    121     87   
27        NaN         99.7               401   14   34     38     36   
35        NaN         78.9               176   18   20     30     21   

      ...     black_num black_per hispanic_num hispanic_per white_num  \
6     ...            63      33.3          109         57.7         4   
13    ...            81      24.7          158         48.2        28   
20    ...            55       8.8          357         57.0        16   
27    ...            90      22.4          275         68.6         8   
35    ...            41      23.3          110         62.5        15   

   white_per male_num male_per female_num female_per  
6        2.1     97.0     51.3       92.0       48.7  
13       8.5    147.0     44.8      181.0       55.2  
20       2.6    330.0     52.7      296.0       47.3  
27       2.0    204.0     50.9      197.0       49.1  
35       8.5     97.0     55.1       79.0       44.9  

[5 rows x 38 columns]
survey
      DBN  rr_s  rr_t  rr_p    N_s   N_t    N_p  saf_p_11  com_p_11  eng_p_11  \
0  01M015   NaN    88    60    NaN  22.0   90.0       8.5       7.6       7.5   
1  01M019   NaN   100    60    NaN  34.0  161.0       8.4       7.6       7.6   
2  01M020   NaN    88    73    NaN  42.0  367.0       8.9       8.3       8.3   
3  01M034  89.0    73    50  145.0  29.0  151.0       8.8       8.2       8.0   
4  01M063   NaN   100    60    NaN  23.0   90.0       8.7       7.9       8.1   

      ...      eng_t_11  aca_t_11  saf_s_11  com_s_11  eng_s_11  aca_s_11  \
0     ...           7.6       7.9       NaN       NaN       NaN       NaN   
1     ...           8.9       9.1       NaN       NaN       NaN       NaN   
2     ...           6.8       7.5       NaN       NaN       NaN       NaN   
3     ...           6.8       7.8       6.2       5.9       6.5       7.4   
4     ...           7.8       8.1       NaN       NaN       NaN       NaN   

   saf_tot_11  com_tot_11  eng_tot_11  aca_tot_11  
0         8.0         7.7         7.5         7.9  
1         8.5         8.1         8.2         8.4  
2         8.2         7.3         7.5         8.0  
3         7.3         6.7         7.1         7.9  
4         8.5         7.6         7.9         8.0  

[5 rows x 23 columns]

Combining the datasets

In [28]:
# Combine the datasets using the DBN column
sep_data_names = [k for k in data]
sep_data = [data[k] for k in sep_data_names]
combined = sep_data[0]
print(sep_data_names)

for i, f in enumerate(sep_data[1:]):
    name = sep_data_names[i+1]
    print(name, f.shape)
    # number of non-unique DBN numbers in each dataset
    print(len(f["DBN"]) - len(f["DBN"].unique()))
    join_type = "left"
    if name in ["grad_outcomes","ap_results"]:
        join_type = "outer"
    combined = combined.merge(f, on="DBN", how=join_type)
['sat_results', 'hs_directory', 'math_results', 'class_size', 'ap_results', 'grad_outcomes', 'demographics', 'survey']
hs_directory (435, 66)
0
math_results (499, 16)
0
class_size (583, 8)
0
ap_results (258, 5)
1
grad_outcomes (405, 23)
0
demographics (1509, 38)
0
survey (1702, 23)
0
In [29]:
combined.shape
Out[29]:
(447, 179)
In [30]:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    combined[col] = pd.to_numeric(combined[col])

combined[cols] = combined[cols].fillna(value=0)
In [31]:
# Generate school_dist column with the school distrcit for later plotting
combined["school_dist"] = combined["DBN"].apply(lambda x: x[:2])
In [32]:
# Fill in missing values in the columns with the mean of the column
combined = combined.fillna(combined.mean())
combined.sample(5)
Out[32]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score total_sat_score school_name borough building_code ... aca_t_11 saf_s_11 com_s_11 eng_s_11 aca_s_11 saf_tot_11 com_tot_11 eng_tot_11 aca_tot_11 school_dist
214 12X690 MONROE ACADEMY FOR BUSINESS/LAW 31 375.0 387.0 376.0 1138.0 NaN NaN NaN ... 6.0 5.5 5.3 5.5 6.2 5.9 5.5 5.5 6.4 12
260 16K594 GOTHAM PROFESSIONAL ARTS ACADEMY 36 370.0 372.0 351.0 1093.0 Gotham Professional Arts Academy Brooklyn K040 ... 6.3 5.8 5.9 6.6 7.5 6.0 6.2 6.5 7.3 16
220 13K419 SCIENCE SKILLS CENTER HIGH SCHOOL FOR SCIENCE,... 92 399.0 417.0 395.0 1211.0 Science Skills Center High School for Science,... Brooklyn K805 ... 7.7 6.0 6.4 6.6 7.2 7.6 7.8 7.9 8.2 13
183 10X696 HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE 92 636.0 648.0 636.0 1920.0 High School of American Studies at Lehman College Bronx X905 ... 9.0 8.1 7.1 7.8 8.6 8.7 7.5 8.0 8.7 10
180 10X549 DISCOVERY HIGH SCHOOL 55 372.0 379.0 364.0 1115.0 Discovery High School Bronx X430 ... 8.2 6.2 5.5 6.1 6.8 7.6 6.7 6.8 7.6 10

5 rows × 180 columns

Compute correlations

Use the corr method to compute correlations for total_sat_score. A value closer to 1 means there's a stronger positive correlation, and a value closer to -1 means there's a stronger negative correlation. A value close to 0 means a weak correlation.

In [33]:
# Compute correlations for the total SAT score
combined.corr()['total_sat_score'].sort_values(ascending = False)
Out[33]:
total_sat_score                         1.000000
SAT Writing Avg. Score                  0.980973
SAT Critical Reading Avg. Score         0.974762
SAT Math Avg. Score                     0.953016
white_per                               0.646541
AP Test Takers                          0.559963
Total Exams Taken                       0.549188
Number of Exams with scores 3 4 or 5    0.548252
asian_per                               0.541135
asian_num                               0.471722
white_num                               0.455099
N_p                                     0.431544
N_s                                     0.425939
female_num                              0.392237
NUMBER OF STUDENTS / SEATS FILLED       0.390003
total_students                          0.389508
AVERAGE CLASS SIZE                      0.387522
total_enrollment                        0.373910
NUMBER OF SECTIONS                      0.355075
male_num                                0.334403
SIZE OF LARGEST CLASS                   0.319454
saf_t_11                                0.299545
N_t                                     0.297445
Total Cohort                            0.291535
rr_s                                    0.288729
SIZE OF SMALLEST CLASS                  0.277447
saf_tot_11                              0.275435
aca_s_11                                0.275310
saf_s_11                                0.262695
aca_tot_11                              0.172161
                                          ...   
CSD                                     0.050962
hispanic_num                            0.048752
eng_t_11                                0.047796
Census Tract                            0.044945
BIN                                     0.042453
black_num                               0.036349
BBL                                     0.034213
aca_p_11                                0.031637
eng_p_11                                0.029995
rr_t                                    0.013452
grade_span_min                         -0.040540
Community Board                        -0.060451
postcode                               -0.073439
com_p_11                               -0.087261
Council District                       -0.087732
male_per                               -0.106793
lat                                    -0.118670
ell_num                                -0.130756
lon                                    -0.135275
black_per                              -0.307403
hispanic_per                           -0.357531
ell_percent                            -0.373996
sped_percent                           -0.417852
frl_percent                            -0.715912
grade_span_max                               NaN
expgrade_span_min                            NaN
expgrade_span_max                            NaN
Year                                         NaN
SCHOOLWIDE PUPIL-TEACHER RATIO               NaN
schoolyear                                   NaN
Name: total_sat_score, Length: 72, dtype: float64

Set the context

In [48]:
# Set up a map centered on NYC and add a marker on the map for each 
# high school in the city
import folium
from folium import plugins as fp

schools_map = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], zoom_start=10)
marker_cluster = fp.MarkerCluster().add_to(schools_map)
for name, row in combined.iterrows():
    folium.Marker([row["lat"], row["lon"]], popup="{0}: {1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster)
schools_map.save('schools.html')
schools_map
Out[48]:
In [49]:
# Create heatmap to visualize where the biggest concentrations of schools are
schools_heatmap = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], zoom_start=10)
schools_heatmap.add_child(fp.HeatMap([row.lat, row.lon] for name, row in combined.iterrows()))
schools_heatmap.save("heatmap.html")
schools_heatmap
Out[49]:
In [50]:
# Compute SAT score by school district and convert school_dist to match geographic district data
district_data = combined.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)))
In [57]:
# Create function for district map
def show_district_map(col):
    geo_path = '/Users/swagbird/Desktop/Data Science/Projects/Dataquest Portfolio Series/Storytelling with Data/data/districts.geojson'
    district_map = folium.Map(location=[district_data['lat'].mean(), district_data['lon'].mean()], zoom_start=10)

    folium.Choropleth(geo_data=geo_path,
                      name='choropleth',
                      data=district_data,
                      columns = ['school_dist', col],
                      key_on='feature.properties.school_dist',
                      fill_color='BuGn',
                      fill_opacity=0.7,
                      line_opacity=0.2
                     ).add_to(district_map)
    district_map.save("districts_{0}.html".format(col))
    return district_map
In [58]:
show_district_map("total_sat_score")
Out[58]:

Exploring enrollment and SAT scores

In [59]:
import matplotlib.pyplot as plt

combined.plot.scatter(x='total_enrollment', y='total_sat_score')
plt.show()
In [60]:
combined[(combined['total_enrollment'] < 1000) & (combined['total_sat_score'] < 1000)]["School Name"]
Out[60]:
109      INTERNATIONAL COMMUNITY HIGH SCHOOL
148                                      NaN
149          BRONX INTERNATIONAL HIGH SCHOOL
163    KINGSBRIDGE INTERNATIONAL HIGH SCHOOL
165    INTERNATIONAL SCHOOL FOR LIBERAL ARTS
208                                      NaN
211            HIGH SCHOOL OF WORLD CULTURES
222       BROOKLYN INTERNATIONAL HIGH SCHOOL
253                      PACIFIC HIGH SCHOOL
264    INTERNATIONAL HIGH SCHOOL AT PROSPECT
279               IT TAKES A VILLAGE ACADEMY
298                MULTICULTURAL HIGH SCHOOL
329     ASPIRATIONS DIPLOMA PLUS HIGH SCHOOL
335    PAN AMERICAN INTERNATIONAL HIGH SCHOO
Name: School Name, dtype: object
In [61]:
combined.plot.scatter(x='ell_percent', y='total_sat_score')
plt.show()
In [62]:
show_district_map("ell_percent")
Out[62]:
In [64]:
combined.corr()["total_sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar()
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c4fa4e0>
In [65]:
combined.corr()["total_sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x1198997f0>
In [66]:
show_district_map("hispanic_per")
Out[66]:
In [67]:
combined.corr()["total_sat_score"][["male_per", "female_per"]].plot.bar()
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c30aeb8>
In [69]:
combined.plot.scatter(x='female_per', y='total_sat_score')
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x11debb438>
In [70]:
combined[(combined['female_per'] > 65) & (combined['total_sat_score'] > 1400)]['School Name']
Out[70]:
8                    BARD HIGH SCHOOL EARLY COLLEGE
25            PROFESSIONAL PERFORMING ARTS HIGH SCH
30                    ELEANOR ROOSEVELT HIGH SCHOOL
44                     TALENT UNLIMITED HIGH SCHOOL
74             FIORELLO H. LAGUARDIA HIGH SCHOOL OF
352                     TOWNSEND HARRIS HIGH SCHOOL
395    FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL
Name: School Name, dtype: object
In [71]:
combined["ap_avg"] = combined["AP Test Takers "] / combined["total_enrollment"]
combined.plot.scatter(x='ap_avg', y='total_sat_score')
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x11751f588>
In [72]:
combined[(combined['ap_avg'] > 0.3) & (combined['total_sat_score'] > 1700)]['School Name']
Out[72]:
30             ELEANOR ROOSEVELT HIGH SCHOOL
41                    STUYVESANT HIGH SCHOOL
176             BRONX HIGH SCHOOL OF SCIENCE
183    HIGH SCHOOL OF AMERICAN STUDIES AT LE
221           BROOKLYN TECHNICAL HIGH SCHOOL
352              TOWNSEND HARRIS HIGH SCHOOL
380    QUEENS HIGH SCHOOL FOR THE SCIENCES A
410      STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: School Name, dtype: object
In [ ]: